Creating Indexes
An
index is a lookup structure created on a table to optimize, sort, and
query performance. Indexes are created on a particular column or
columns and store the data values for this column or columns in order.
When raw underlying table data is stored in no particular order, this
situation is referred to as a heap.
The heap is composed of multiple pages, with each page containing
multiple table rows. When raw underlying data is stored in order,
sorted by a column or columns, this situation is referred to as a clustered index.
For example, if you have a table named Customer, with a clustered index
on the FullName column, the rows in this table will be stored in order,
sorted by the full name. This means that when you are searching for a
particular full name, the query optimizer component can execute the
query more efficiently by performing an index lookup rather than a table scan. Only one clustered index is allowed per table; usually this is created on the column designated as the PRIMARY KEY.
You can also create additional nonclustered indexes
on a table that is stored either as a heap or as a clustered index. A
nonclustered index is a separate lookup structure that stores index
values in order, and with each index value, it stores a pointer to the
data page containing the row with this index value. Nonclustered
indexes speed up data retrieval. It makes sense to create nonclustered
indexes on all frequently searched fields in a table. The trade-off
with indexes is write performance. Every time a new row is inserted,
the index must also be updated. When writing data to a table with
nonclustered indexes, sometimes the pages within the table have to be
rearranged to make room for the new values. In addition, indexes are
storage structures that take up disk space. Indexes are created using
the CREATE INDEX statement. Example 1 shows the syntax for creating an index.
Example 1. CREATE INDEX Statement—Syntax
CREATE [ UNIQUE ] [ CLUSTERED | NONCLUSTERED ] INDEX index_name ON table_or_view ( column1 [ ASC | DESC ], column2, ...n) [ INCLUDE (additional_column_name, ...n) ] [ WHERE filter_clause] [ WITH OPTIONS]
|
The CREATE INDEX statement creates a clustered or nonclustered index on a specified column or columns. You can choose to create the index as UNIQUE, which will enforce a UNIQUE constraint on the index columns. A filter_clause
can be specified to create indexes only on a subset of data that meets
specific criteria. This is useful for a very large table, where
creating an index on all values of a particular column will be
impractical. Table 1 summarizes index options that can be used with the CREATE INDEX statement.
Table 1. Index Options
Option | Explanation |
---|
PAD_INDEX = ON | OFF | When this option is ON,
free space is allocated in each page of the index. Allows for new
values to be inserted without rearranging a large amount of data. The
amount of free space allocated is specified by the FILLFACTOR parameter. When this option is OFF, enough free space for one row is reserved in every page during index creation. |
FILLFACTOR = fill factor percentage | Specifies the percentage of each page percentage.
that should be filled up with data. For example, a fill factor of 80
means 20% of each page will be empty and available for new data. The
fill factor is used only when you create or rebuild an index. Fill
factor and index padding are discussed in detail in |
SORT_IN_TEMPDB = ON | OFF | Specifies
whether the data should be sorted in the tempdb database instead of the
current database. This may give performance advantages if the tempdb
database is stored on a different disk to the current database. |
IGNORE_DUP_KEY =
ON | OFF | Specifies that duplication errors should be ignored when creating unique indexes. |
STATISTICS_NORECOMPUTE =ON | OFF
| Specifies that optimization statistics should not be updated at this time. |
DROP_EXISTING = ON | OFF | Specifies that the existing index with the same name should be dropped and then be re-created. This equates to an index rebuild. |
ONLINE = ON | OFF | Specifies
that the underlying table should remain online and accessible by users
while the index is being built. This option is only available in SQL
Server 2008 Enterprise or Developer edition. |
ALLOW_ROW_LOCKS = ON | OFF | Specifies whether locks should be held on each row, as necessary. |
ALLOW_PAGE_LOCKS = ON | OFF | Specifies whether locks should be held on each page, as necessary. |
MAXDOP = max_degree_of_parallelism | Specifies the maximum number of processors that are to be used during the rebuild operation. |
DATA_COMPRESSION = NONE | ROW | PAGE | Use data compression at row or page level of the index.
|
Example 2
creates a clustered index (by star name) and a nonclustered index (by
star type) on the Stars table we created in the previous example. Figure 1. 3IX_Star_Name can be created using the interface of SQL Server Management Studio.
Example 2. Working with Indexes
--Create the table specifying that the PRIMARY KEY index is to be created as nonclustered CREATE TABLE Stars (StarID int PRIMARY KEY NONCLUSTERED, StarName varchar(50) Unique, SolarMass decimal(10,2) CHECK(SolarMass > 0), StarType varchar(50) DEFAULT 'Orange Giant'); GO CREATE CLUSTERED INDEX Ix_Star_Name ON Stars(StarName) WITH (PAD_INDEX = ON, FILLFACTOR = 70, ONLINE = ON); GO CREATE NONCLUSTERED INDEX Ix_Star_Type ON Stars (StarType) WITH (PAD_INDEX = ON, FILLFACTOR = 90); GO
|
When you are creating a PRIMARY KEY constraint, an index on the column(s) designated as PRIMARY KEY
will be created automatically. This index will be clustered by default,
but this can be overridden when creating the index by specifying the PRIMARY KEY NONCLUSTERED
option. As a best practice, it is recommended that you accept the
default of the clustered PRIMARY KEY column, unless you have a specific
reason to designate another column as the clustered index key. Usually,
the automatically created index is named PK_TableName_<Unique Number>, but this can be changed at any time by renaming the index. For example, a newly created Stars table with a PRIMARY KEY of StarID automatically has an index named UQ__Stars__A4B8A52A5CC1BC92.
Warning
Remember that when creating a table, a unique index will be automatically created on the columns designated as the PRIMARY KEY.
If you wish to avoid the long rebuild time associated with building a
clustered index, or if you wish to create the clustered index on a
column different from the PRIMARY KEY, you must explicitly specify the PRIMARY KEY NONCLUSTERED option. The PRIMARY KEY will always be unique.
Working with Full–Text Indexes
Standard indexes are great when used with the simple WHERE clause of the SELECT
statement. An index will greatly reduce the time it will take you to
locate rows where the indexed column is equal to a certain value, or
when this column starts with a certain value. However, standard indexes
are inadequate for fulfilling more complex text-based queries. For
example, creating an index on StarType will not help you find all rows
where the StarType column contains the word “giant,” but not the word
“supermassive”.
To
fulfill these types of queries, you must use full-text indexes.
Full-text indexes are complex structures that consolidate the words
used in a column and their relative weight and position, and link these
words with the database page containing the actual data. Full-text
indexes are built using a dedicated component of SQL Server 2008—the Full-Text Engine.
In SQL Server 2005 and earlier, the Full-Text Engine was its own
service, known as full-text search. In SQL Server 2008, the Full-Text
Engine is part of the database engine (running as the SQL Server
Service).
Full-text
indexes can be stored on a separate filegroup. This can deliver
performance improvements, if this filegroup is hosted on a separate
disk from the rest of the database. Only one full-text index can be
created on a table, and it can only be created on a single, unique
column that does not allow null values. Full-text indexes must be based
on columns of type char, varchar, nchar, nvarchar, text, ntext, image, xml, varbinary, and varbinary(max).
You must specify a type column, when creating a full-text index on a
image, varbinary, or varbinary(max) columns. The type column stores the
file extension (.docx, .pdf, .xlsx) of the document stored in the
indexed column.
Example 3 amends the Stars table to include a Description column and creates a full-text index on this column. The FREETEXT
function allows us to search on any of the words specified using the
full-text index. This yields a similar user experience as using an
Internet search engine.
Example 3 Creating and Using a Full-Text Index
ALTER TABLE Stars ADD Description ntext DEFAULT 'No description specified' NOT NULL ; GO CREATE FULLTEXT CATALOG FullTextCatalog AS DEFAULT; CREATE FULLTEXT INDEX ON Stars (Description) KEY INDEX PK__Stars__06ABC6465F9E293D; GO UPDATE Stars SET Description = 'Deneb is the brightest star in the constellation Cygnus and one of the vertices of the Summer Triangle. It is the 19th brightest star in the night sky, with an apparent magnitude of 1.25. A white supergiant, Deneb is also one of the most luminous stars known. It is, or has been, known by a number of other traditional names, including Arided and Aridif, but today these are almost entirely forgotten. Courtesy Wikipedia.' WHERE StarName = 'Deneb'; UPDATE Stars SET Description = 'Pollux, also cataloged as Beta Geminorum, is an orange giant star approximately 34 light-years away in the constellation of Gemini (the Twins). Pollux is the brightest star in the constellation (brighter than Castor (Alpha Geminorum). As of 2006, Pollux was confirmed to have an extrasolar planet orbiting it. Courtesy Wikipedia.' WHERE StarName = 'Pollux'; GO SELECT StarName FROM Stars WHERE FREETEXT (Description, 'planet orbit, giant'); GO -- Results: -- StarName -- -------------------------------------------------- -- Pollux
|
Partitioning Data
When
working with large databases, query performance often becomes an issue,
even if your indexing strategy is spot-on. If you have decided that
indexing is not enough to produce your desired result, your next step
can be data partitioning. Data partitioning separates a database into
multiple filegroups containing one or more files. These filegroups are
placed on different disks, enabling parallel read and write operations,
thus significantly improving
performance. Approach a partitioning strategy by separating different
tables and indexes into different filegroups and placing them on
separate disks. As a guide, always separate large, frequently accessed
tables that are in a FOREIGN KEY relationship, so that they can be
scanned in parallel when performing a join.
If
the desired performance is not achieved by simple partitioning, this is
usually due to very large single tables. You can employ a horizontal or vertical partitioning
technique to split a single large table into multiple smaller tables.
Queries that access this table will run quicker, and performance of
maintenance tasks, such as backup and index rebuild, will also be
improved.
Horizontal Partitioning
Horizontal
partitioning splits a table into several smaller tables by separating
out clusters of rows, based on a partitioning function. The structure
of the smaller tables will remain the same as the structure of the
initial table, but the smaller tables will contain fewer rows. For
example, if you have a very large table that has 100 million rows, you
can partition it into 10 tables containing 10 million rows each. Date
columns are often a good choice for horizontal partitioning. For
example, a table could be partitioned historically by year—each year
stored in a smaller table. Thus, if a query requires data for specific
dates, only one smaller table needs to be scanned.
Analyze
the data and how your users are accessing this data in order to derive
the best horizontal partitioning strategy. Aim to partition the tables
so that the majority of the queries can be satisfied from as few
smaller tables as possible. To join smaller tables together, UNION queries are required, and these can degrade performance.
Vertical Partitioning
Unlike
horizontal partitioning, vertical partitioning separates different
columns of a single table into multiple tables. The resultant smaller
tables have the same number of rows as the initial table, but the
structure is different. Two types of vertical partitioning are
available:
Normalization
Normalization is the process of applying logical database design
techniques to reduce data duplication. This is achieved mainly by
identifying logical relationships within your data and implementing
multiple tables related by FOREIGN KEY constraints.
Row splitting
This technique separates some columns from a larger table into another
table or tables. Essentially, each logical row in a table partitioned
using row splitting is stored across two tables. To maintain integrity
between the tables, use a FOREIGN KEY constraint when both the primary and FOREIGN KEY participants are unique. This is known as a one-to-one relationship.
If
implemented correctly, vertical partitioning reduces the time it takes
to scan data. Use row splitting to separate frequently used and rarely
accessed columns into separate tables, and eliminate overhead. The
drawback of vertical partitioning is the processing time and resources
it takes to perform the joins, when needed.